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Abstract — We develop and analyze methods for com- 
puting provably optimal maximum a posteriori (MAP) 
configurations for a subclass of Markov random fields 
defined on graphs with cycles. By decomposing the original 
distribution into a convex combination of tree-structured 
distributions, we obtain an upper bound on the optimal 
value of the original problem (i.e., the log probability of 
the MAP assignment) in terms of the combined optimal 
values of the tree problems. We prove that this upper 
bound is tight if and only if all the tree distributions 
share an optimal configuration in common. An important 
implication is that any such shared configuration must 
also be a MAP configuration for the original distribution. 
Next we develop two approaches to attempting to obtain 
tight upper bounds: (a) a tree-relaxed linear program (LP), 
which is derived from the Lagrangian dual of the upper 
bounds; and (b) a tree-reweighted mux-product message- 
passing algorithm that is related to but distinct from 
the max-product algorithm. In this way, we establish a 
connection between a certain LP relaxation of the mode- 
finding problem, and a reweighted form of the max-product 
(min-sum) message-passing algorithm. 

Keywords: Approximate inference; integer programming; 
iterative decoding; linear programming relaxation; Markov 
random fields; max-product algorithm; message-passing 
algorithms; min-sum algorithm; MAP estimation; marginal 
polytope. 



I. Introduction 

Integer programming problems arise in various fields, 
including communication theory, error-correcting cod- 
ing, image processing, statistical physics and machine 
learning [e.g., 35], [39], [8]. Many such problems can 
be formulated in terms of Markov random fields [e.g., 
8], [14], in which the cost function corresponds to a 
graph-structured probability distribution, and the goal is 
to find the maximum a posteriori (MAP) configuration. It 
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is well-known that the complexity of solving the MAP 
estimation problem on a Markov random field (MRF) 
depends critically on the structure of the underlying 
graph. For cycle-free graphs (also known as trees), the 
MAP problem can be solved by a form of non-serial 
dynamic programming known as the max-product or 
min-sum algorithm [e.g., 2], [14], [15]. This algorithm, 
which entails passing "messages" from node to node, 
represents a generalization of the Viterbi algorithm [40] 
from chains to arbitrary cycle-free graphs. In recent 
years, the max-product algorithm has also been studied 
in application to graphs with cycles as a method for 
computing approximate MAP assignments [e.g., 1], [21], 
[22], [23], [29], [43]. Although the method may perform 
well in practice, it is no longer guaranteed to output 
the correct MAP assignment, and it is straightforward to 
demonstrate problems on which it specifies an incorrect 
(i.e., non-optimal) assignment. 

A. Overview 

In this paper, we present and analyze new methods 
for computing MAP configurations for MRFs defined 
on graphs with cycles. The basic idea is to use a convex 
combination of tree-structured distributions to derive 
upper bounds on the cost of a MAP configuration. We 
prove that any such bound is tight if and only if the trees 
share a common optimizing configuration; moreover, any 
such shared configuration must be MAP-optimal for the 
original problem. Consequently, when the bound is tight, 
obtaining a MAP configuration for a graphical model 
with cycles — in general, a very difficult problem — is 
reduced to the easy task of examining the optima of a 
collection of tree-structured distributions. 

Accordingly, we focus our attention on the problem of 
obtaining tight upper bounds, and propose two methods 
directed to this end. Our first approach is based on the 
convexity of the upper bounds, and the associated theory 
of Lagrangian duality. We begin by re-formulating the 
exact MAP estimation problem on a graph with cycles 
as a linear program (LP) over the so-called marginal 
polytope. We then consider the Lagrangian dual of the 
problem of optimizing our upper bound. In particular, 
we prove that this dual is another LP, one which has a 
natural interpretation as a relaxation of the LP for exact 



MAP estimation. The relaxation is obtained by replacing 
the marginal polytope for the graph with cycles, which 
is a very complicated set in general, by an outer bound 
with simpler structure. This outer bound is an exact 
characterization of the marginal polytope of any tree- 
structured distribution, for which reason we refer to this 
approach as a tree-based LP relaxation. 

The second method consists of a class of message- 
passing algorithms designed to find a collection of tree- 
structured distributions that share a common optimum. 
The resulting algorithm, though similar to the standard 
max-product (or min-sum) algorithm [e.g., 23], [43], dif- 
fers from it in a number of important ways. In particular, 
under the so-called optimum specification criterion, fixed 
points of our tree-reweighted max-product algorithm 
specify a MAP-optimal configuration with a guarantee 
of correctness. We also prove that under this condition, 
fixed points of the tree-reweighted max-product updates 
correspond to dual-optimal solutions of the tree-relaxed 
linear program. As a corollary, we establish that the 
ordinary max-product algorithm on trees is solving the 
dual of an exact LP formulation of the MAP estimation 
problem. 

Overall, this paper establishes connections between 
two approaches to solving the MAP estimation problem: 
LP relaxations of integer programming problems [e.g., 
7], [38], and (approximate) dynamic programming meth- 
ods using message-passing in the max-product alge- 
bra. More specifically, our work shows that a (suitably 
reweighted) form of the max-product or min-sum algo- 
rithm is very closely connected to a particular linear 
programming relaxation of the MAP integer program. 
This variational characterization has links to the recent 
work of Yedidia et al. [47], who showed that the sum- 
product algorithm has a variational interpretation involv- 
ing the so-called Bethe free energy. In addition, the 
work described here is linked in spirit to our previous 
work [41], [44], in which we showed how to upper 
bound the log partition function using a "convexified 
form" of the Bethe free energy. Whereas this convex 
variational problem led to a method for computing 
approximate marginal distributions, the current paper 
deals exclusively with the problem of computing MAP 
configurations. Importantly and in sharp contrast with 
our previous work, there is a non-trivial set of problems 
for which the upper bounds of this paper are tight, 
in which case the MAP-optimal configuration can be 
obtained by the techniques described here. 

B. Notes and related developments 

We briefly summarize some developments related to 
the ideas described in this paper. In a parallel collab- 
oration with Feldman and Karger [19], [18], [20], we 
have studied the tree-relaxed linear program (LP) and 



related message-passing algorithms as decoding methods 
for turbo-like and low-density parity check (LDPC) 
codes, and provided finite-length performance guaran- 
tees for particular codes and channels. In independent 
work, Koetter and Vontobel [30] used the notion of 
a graph cover to establish connections between the 
ordinary max-product algorithm for LDPC codes, and 
a particular polytope equivalent to the one defining our 
LP relaxation. In other independent work, Wiegerinck 
and Heskes [46] have proposed a "fractional" form of 
the sum-product algorithm that is closely related to 
the tree-reweighted sum-product algorithm considered 
in our previous work [44]; see also Minka [36] for a 
reweighted version of the expectation propagation algo- 
rithm. In other work, Kolmogorov [31], [32] has studied 
the tree-reweighted max-product message-passing algo- 
rithms presented here, and proposed a sequential form 
of tree-updates for which certain convergence guarantees 
can be established. In follow-up work, Kolmogorov and 
Wainwright [33] provided stronger optimality properties 
of tree-reweighted message-passing when applied to 
problems with binary variables and pairwise interactions. 

C. Outline 

The remainder of this paper is organized as follows. 
Section ITT1 provides necessary background on graph the- 
ory and graphical models, as well as some preliminary 
details on marginal polytopes, and a formulation of the 
MAP estimation problem. In Section [HI] we introduce 
the basic form of the upper bounds on the log probability 
of the MAP assignment, and then develop necessary 
and sufficient conditions for these bounds to be tight. 
In Section IIVI we first discuss how the MAP integer 
programming problem has an equivalent formulation as 
a linear program (LP) over the marginal polytope. We 
then prove that the Lagrangian dual of the problem of 
optimizing our upper bounds has a natural interpretation 
as a tree-relaxation of the original LP. Section [V] is 
devoted to the development of iterative message-passing 
algorithms and their relation to the dual of the LP 
relaxation. We conclude in Section fVll with a discussion 
and extensions to the analysis presented here. 

II. Preliminaries 

This section provides the background and some pre- 
liminary developments necessary for subsequent sec- 
tions. We begin with a brief overview of some graph- 
theoretic basics; we refer the reader to the books [9], 
[10] for additional background on graph theory. We then 
describe the formalism of Markov random fields; more 
details can be found in various sources [e.g., 12], [14], 
[34]. We conclude by formulating the MAP estimation 
problem for a Markov random field. 



A. Undirected graphs 

An undirected graph G — (V, E) consists of a set of 
nodes or vertices V = {1, . . . , n} that are joined by a 
set of edges E. In this paper, we consider only simple 
graphs, for which multiple edges between the same pair 
of vertices are forbidden. For each s G V, we let T(s) = 
{ t G V (s,t) G E } denote the set of neighbors of s. 
A clique of the graph G is a fully-connected subset C 
of the vertex set (i.e., (s,t) G E for all s,t G C). The 
clique C is maximal if it is not properly contained within 
any other clique. A cycle in a graph is a path from a node 
s back to itself; that is, a cycle consists of a sequence of 
distinct edges {( s , si), (si,s 2 ), . . . , (sfc-i, s k ) } such 
that s = s fc . 

A subgraph of G = (V, E) is a graph iJ = 
(V(H),E(H)) where V(#) (respectively E{H)) are 
subsets of V (respectively E). Of particular importance 
to our analysis are those (sub)graphs without cycles. 
More precisely, a tree is a cycle-free subgraph T = 
(V(T), E(T)); it is spanning if it reaches every vertex 
(i.e., V(T) — V). 

B. Markov random fields 

A Markov random field (MRF) is defined on the basis 
of an undirected graph G — (V, E) in the following way. 
For each s G V, let X s be a random variable taking 
values x s in some sample space X s . This paper deals 
exclusively with the discrete case, for which X s takes 
values in the finite alphabet X s := {0, . . . , m s — 1}. 
By concatenating the variables at each node, we ob- 
tain a random vector X = { X s \ s G V} with n — 
\V\ elements. Observe that X itself takes values x in 
the Cartesian product space X" := X\ x X% x • • • x X n . 
For any subset A C V, we let X^ denote the collection 
{X s | s G A} of random variables associated with nodes 
in A, with a similar definition for xa- 

By the Hammersley-Clifford theorem [e.g., 34], any 
Markov random fields that is strictly positive (i.e., 
p(x) > for all x G X n ) can defined either in terms of 
certain Markov properties with respect to the graph, or in 
terms of a decomposition of the distribution over cliques 
of the graph. We use the latter characterization here. For 
the sake of development in the sequel, it is convenient 
to describe this decomposition in exponential form [e.g., 
3]. We begin with some necessary notation. A potential 
function associated with a given clique C is mapping 
<fi : X n — > R that depends only on the subcollection 
Xc ■= {x s | s G C}. There may be a family of potential 
functions {<j) a | a G 1(C)} associated with any given 
clique, where a is an index ranging over some set 1(C). 
Taking the union over all cliques defines the overall 
index set 2 — \JcI(C). The full collection of potential 
functions {<f) a \ a G 1} defines a vector-valued mapping 



<f> : X n — > R d , where d — \T\ is the total number of 
potential functions. Associated with is a real-valued 
vector 6 = {6 a \aET}, known as the exponential 
parameter vector. For a fixed x G X n , we use (9, </>(x)) 
to denote the ordinary Euclidean product in M. d between 
9 and </>(x). 

With this set-up, the collection of strictly positive 
Markov random fields associated with the graph G 
and potential functions <fi can be represented as the 

exponential family {p(x;9) | 9 G M. d }, where 

p(x;0) oc exp{(6>, </>(x))} = exp { 9 a 4> a (x\}) 

Note that each vector 9 G K d indexes a particular 
Markov random field p(x; 9) in this exponential family. 

Example 1: The Ising model of statistical 
physics [e.g., 5] provides a simple illustration of 
a collection of MRFs in this form. This model involves 
a vector x G { — 1,1}™, with a distribution defined by 
potential functions only on cliques of size at most two 
(i.e., vertices and edges). As a result, the exponential 
family in this case takes the form: 

p(x;9) oc exp { ^ 6 s x s + ^ 9 st x s x t }. (2) 

Here 6 s t is me weight on edge (s,t), and 9 S is the 
parameter for node s. In this case, the index set X 
consists of the union VUE. Note that the set of potentials 
{x s ,s G V} U {x s Xt,(s, t) G E} is a basis for all 
multinomials on { — 1,1}™ of maximum degree two that 
respect the structure of G. (} 
When the collection of potential functions <f> do not 
satisfy any linear constraints, then the representation Q 
is said to be minimal [3], [4]. For example, the Ising 
model Q is minimal, because there is no linear combina- 
tion of the potentials <fi = {x s , s G V}U{x s x t , (s, t) G 
E} that is equal to a constant for all x G {—1, 1}™- In 
contrast, it is often convenient to consider an overcom- 
plete representation, in which the potential functions <p 
do satisfy linear constraints, and hence are no longer a 
basis. More specifically, our development in the sequel 
makes extensive use of an overcomplete representation 
in which the basic building blocks are indicator functions 
of the form Sj(x s ) — the function that is equal to one if 
x s = j, and zero otherwise. In particular, for a Markov 
random field with interactions between at most pairs of 
variables, we use the following collection of potential 
functions: 

{8j(x a ) | j eX s } for s G V, (3a) 
{S J (x s )S k (x t ) | (J, k) G X s x X t } far(M) e(3B) 

which we refer to as the canonical overcomplete rep- 
resentation. This representation involves a total of d := 



Ssev m s+S( s t)eE m s m t potential functions, indexed 
by the set 

Z:= [U sev {(s;j),j e X s }] 

U [u (s , t)eE {(st;jk),(3,k) e X s x X f }]. (4) 

The overcompleteness of the representation is manifest 
in various linear constraints among the potentials; for 
instance, the relation Sj(x s ) — J2 X s x ^j( x s)^k( x t) = 
holds for all x s G X s . As a consequence of this overcom- 
pleteness, there are many exponential parameters corre- 
sponding to a given distribution (i.e., p(x; 9) — p(x; 9) 
for 9 7^ 6). Although this many-to-one correspondence 
might seem undesirable, its usefulness is illustrated in 
Section |V] 

The bulk of this paper focuses exclusively on MRFs 
with interactions between at most pairs (x s ,xt) of ran- 
dom variables, which we refer to as pairwise MRFs. 
In principle, there is no loss of generality in restricting 
to pairwise interactions, since any factor graph over 
discrete variables can be converted to this form by 
introducing auxiliary random variables [23]; see Ap- 
pendix|X]for the details of this procedure. Moreover, the 
techniques described in this paper can all be generalized 
to apply directly to MRFs that involve higher-order 
interactions, by dealing with hypertrees as opposed to 
ordinary trees. 1 Moreover, with the exception of specific 
examples involving the Ising model, we exclusively use 
the canonical overcomplete representation (0 defined in 
terms of indicator functions. 

C. Marginal distributions on graphs 

Our analysis in the sequel focuses on the local 
marginal distributions that are defined by the indica- 
tor functions in the canonical overcomplete represen- 
tation (|3}. In particular, taking expectations of these 
indicators with respect to some distribution p(-) yields 
marginal probabilities for each node s G V 

fi s;:j = E p [6j(x s )] ■= ^2 p(x)5j(x s ) (5) 

x£X n 

and for each edge (s, t) G E 

fist-jk ^■p[Sj(x a )S k (x t )] (6) 

Note that equations (0 and define a d-dimensional 
vector jU = {/U a ,a G X} of marginals, indexed by 
elements of I defined in equation (g). We let MARG(G) 

'For brevity, we do not discuss hypertrees at length in this paper. 
Roughly speaking, they amount to trees formed on clusters of nodes 
from the original graph; see Wainwright et al. [42] for further details 
on hypertrees. 



denote the set of all such marginals realizable in this 
way: 

MARG(G) := {/x G R d \ fx S]j = E p [Sj(x s )], and 
Ustjk = E p [6 j (x s )6k(x t )} for some p(-) }. (8) 

The conditions defining membership in MARG(G) can 
be expressed more compactly in the equivalent vector 
form /i = E p [(/)(x)] = J2 x ex n -P( x ) < /'( x )' where <p de- 
notes a vector consisting of the potential functions form- 
ing the canonical overcomplete representation (|3}- We 
refer to MARG(G) as the marginal polytope associated 
with the graph G. 

By definition, any marginal polytope is the convex hull 
of a finite number of vectors — namely, the collection 
{</>(x) | x G X n }. Consequently, the Minkowski- Weyl 
theorem [37] ensures that MARG(G) can be represented 
as an intersection of half-spaces Pij^jHa ^. where J 
is a finite index set and each half-space is of the form 
H aj , bj := {/]£!'' (a,j, < b 3 } for some a 3 G 
R d , and bj G R. These half-space constraints include 
the non-negativity condition /i Q > for each a £ X. 
Moreover, due to the overcompleteness of the canonical 
overcomplete representation, there are various equality 2 
constraints that must hold; for instance, for all nodes 
s G V, we have the constraint Yljex M^j = 1 ■ 

The number of additional (non-trivial) linear con- 
straints required to characterize MARG(G), though al- 
ways finite, grows rapidly in n for a general graph with 
cycles; see Deza and Laurent [16] for discussion of the 
binary case. It is straightforward, however, to specify 
a subset of constraints that any /i G MARG(G) must 
satisfy. First, as mentioned previously, since the elements 
of /i are marginal probabilities, we must have /i > 
(meaning that /i is in the positive orthant). Second, 
as local marginals, the elements of /i must satisfy the 
normalization constraints: 

J2 = 1 V « e V, (9a) 

Vsujk = 1 V{s,t)£E. (9b) 

{j,k)ex a xx t 

Third, since the single node marginal over x s must be 
consistent with the joint marginal on (x s , Xt), the follow- 
ing marginalization constraint must also be satisfied: 

Y V>st-,jk = Vs;j V (s,t) G E, j G X s . (10) 

kdX t 

On the basis of these constraints, 3 we define the set 
LOCAL(G) as all \x G that satisfy constraints d9aV 

2 Any equality constraint (a, fj) = 6 is equivalent to enforcing the 
pair of inequality constraints (a, fi) < b and (—a, fj.) < —b. 

3 Note that the normalization const raint on {n 3 f t jk} is redundant 
given the marginalization constraint 1101 . and the normalization of 



( I9b> . and dlOi . Here it should be understood that there 
are two sets of marginalization constraints for each 
edge (s,t): one for each of the variables x s and xt- 
By construction, LOCAL(G) specifies an outer bound 
on MARG(G); moreover, in contrast to MARG(G), it 
involves only a number of inequalities that is polyno- 
mial in n. More specifically, LOCAL(G) is defined by 
0(mn + m 2 \E\) inequalities, where m := max s \X S \. 
Since the number of edges \E\ is at most (2), this 
complexity is at most 0(m 2 n 2 ). The constraint set 
LOCAL(G) plays an important role in the sequel. 

D. MAP estimation 

Of central interest in this paper is the computation of 
maximum a posteriori (MAP) configurations 4 for a given 
distribution in an exponential form — i.e., configurations 
in the set argmax x6 ^i. p(x; 6), where 9 6 R d is a given 
vector of weights. For reasons to be clarified, we refer 
to p(x; 9) as the target distribution. The problem of 
computing a MAP configuration arises in a wide variety 
of applications. For example, in image processing [e.g., 
8], computing MAP estimates can be used as the basis 
for image segmentation techniques. In error-correcting 
coding [e.g., 35], a decoder based on computing the 
MAP codeword minimizes the word error rate. 

When using the canonical overcomplete representation 
4>{x) = {Sj(x s ), Sj(x s )8k(xt)}, it is often convenient 
to represent the exponential parameters in the following 
functional form: 

9 s (x s ) := J2 0>M X >)> ( lla > 

9 st (x s ,x t ) := ^2 S f t jkSj(x s )8 k (xt<)llb) 

<j,k)ex s xx t 

With this notation, the MAP problem is equivalent to 
finding a configuration xmap G X n that maximizes the 
quantity 

(S, <Mx)> := £)0.(;r.) + 9st(x s ,Xt)(l2) 
sev ( S) t)eB 

Although the parameter 9 is a known and fixed quan- 
tity, it is useful for analytical purposes to view it as a 
variable, and define a function (9) as follows: 

$oo(£) := max(0, 0(x)>. (13) 

xG X n 

Note that $oo (9) represents the value of the optimal 
(MAP) configuration as 9 ranges over M. d . As the max- 
imum of a collection of linear functions, is convex 
in terms of 9. 

4 The term a posteriori arises from applications, in which case one 
often wants to compute maximizing elements of the posterior distri- 
bution p(x |y; 6), where y is a fixed collection of noisy observations. 
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III. Upper bounds via convex combinations 

This section introduces the basic form of the upper 
bounds on (9) to be considered in this paper. The 
key property of $oo is its convexity, which allows us 
to apply Jensen's inequality [28]. More specifically, let 
{p 1 } be a finite collection of non-negative weights that 
sum to one, and consider a collection {9 1 } of exponen- 
tial parameters such that J^i P % ® % — Then applying 
Jensen's inequality yields the upper bound 

$oc(o) < 5> i $oc(n (14) 

i 

Note that the bound dl4> holds for any collection of 
exponential parameters {9 1 } that satisfy '}2 i p l 9 t = 9; 
however, the bound will not necessarily be useful, unless 
the evaluation of ^ aD (9 l ) is easier than the original prob- 
lem of computing $00(0). Accordingly, in this paper, we 
focus on convex combinations of tree-structured expo- 
nential parameters (i.e., the set of non-zero components 
of 9 l is restricted to an acyclic subgraph of the full 
graph), for which exact computations are tractable. In 
this case, each index i in equation dl4> corresponds 
to a spanning tree of the graph, and the corresponding 
exponential parameter is required to respect the structure 
of the tree. In the following, we introduce the necessary 
notation required to make this idea precise. 

A. Convex combinations of trees 

For a given graph, let T denote a particular spanning 
tree, and let 1 = T(G) denote the set of all spanning 
trees. For a given spanning tree T = (V,E(T)), we 
define a set 

1{T) = {(s;j) I a G V, j G X,} 

U {(st;jk) I G E(T), (j,k) GX s x X t }, 

corresponding to those indexes associated with all ver- 
tices but only edges in the tree. 

To each spanning tree T G % we associate an expo- 
nential parameter 9(T) that must respect the structure of 
T. More precisely, the parameter 9(T) must belong to 
the linear constraint set £(T) given by 

{ 9{T) eR d I 9 a (T) =0 Vq G 1\J(T) }. (15) 

By concatenating all of the tree vectors, we form a larger 
vector = {9(T), T e %}, which is an element of 
K dx l 3: ( G )l. The vector must belong to the constraint 
set 

£ := {6 G R dx l 2; ( G )i j 9{T) e £{T) VT e T(G)}. 

(16) 

In order to define convex combinations of exponential 
parameters defined on spanning trees, we require a 
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(a) (b) 




Fig. 1. Illustration of edge appearance probabilities. 
Original graph is shown in panel (a). Probability 1/3 is 
assigned to each of the three spanning trees { T; | i — 
1,2,3} shown in panels (b)-(d). Edge b is a so- 
called bridge in G, meaning that it must appear in 
any spanning tree. Therefore, it has edge appearance 
probability pi = 1. Edges e and / appear in two and 
one of the spanning trees, respectively, which gives 
rise to edge appearance probabilities p e — 2/3 and 
PS = 1/3. 



probability distribution p over the set of spanning trees 
p := {p(T) | p(T) > 0, J2 = 1 >• 

TGI 

For any distribution p, we define its support to be the set 
of trees to which it assigns strictly positive probability; 
that is 

supp(p) := {TeT|p(T)>0}. (17) 

In the sequel, it will also be of interest to consider the 
probability p e = Pr p {e G T} that a given edge e £ £ 
appears in a spanning tree T chosen randomly under p. 
We let p e = {p e | e G E} represent a vector of these 
edge appearance probabilities. Any such vector p e must 
belong to the so-called spanning tree poly tope [7], [17], 
which we denote by T(G). See Figure ^for an illustra- 
tion of the edge appearance probabilities. Although we 
allow for the support supp(p) to be a strict subset of 
the set of all spanning trees, we require that p e > for 
all e 6 E, so that each edge appears in at least one tree 
with non-zero probability. 

Given a collection of tree-structured parameters and 
a distribution p, we form a convex combination of tree 



exponential parameters as follows 

EpPCD] :=£/>(r)0(T). (18) 

T 

Let 8 G M. d be the target parameter vector for which 
we are interested in computing $00, as well as a MAP 
configuration of p(x;0). For a given p, of interest are 
collections 8 of tree-structured exponential parameters 
such that E p [0(T)] = 0. Accordingly, we define the 
following constraint set: 

A p (0) := {9 e £ \ E P [0(T)] = 8}. (19) 

It can be seen that A p (0) is never empty as long as 
p e > for all edges e G E. We say that any member 
of A p (8) specifies a p-reparameterization of p(x;0). 

Example 2 (Single cycle): To illustrate these defini- 
tions, consider a binary vector x G {0, l} 4 on a 4-node 
cycle, with the distribution in the minimal Ising form 

p(x;0) oc ex.p{xiX2 + X2X3 + X3Xi + X4X1}. 

In words, the target distribution is specified by the 
minimal parameter 8 =[0 000 1111], where 
the zeros represent the fact that 8 S = for all s G V. 
Suppose that p is the uniform distribution p(Tj) = 1/4 
for i = 1, ... 4, so that p e — 3/4 for each edge e G E. 
We construct a member 6 of A p (8), as follows: 
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With this choice, it is easily verified that E p [8(T)] = 8 
so that 9 G A p {8). 



B. Tightness of upper bounds 

It follows from equations dl4> . Jl 81 and ( 1191 that for 
any G A p {8), there holds: 

$00 (0) < ^pCn^cr)) 

T 

= £p(T) max {(0(T), 0(x))}. (20) 

T X 

Our first goal is to understand when the upper bound J20I 
is tight — that is, met with equality. It turns out that that 
equality holds if and only if the collection of trees share 
a common optimum, which leads to the notion of tree 
agreement. 

More formally, for any exponential parameter vector 
8 G R d , define the collection OPT(6>) of its optimal 
configurations as follows: 

{xeT (0, cp(x')) < (0, </>(x)) for all x' G X n }. 

(21) 



. Note that by the definition dl 31 of $00, there holds 
(§, 0(x)> = $oo(^) for any x G OPT(0). With this 
notation, we have: 

Proposition 1 (Tree agreement): Let 6 — {9(T)} G 
A p {9), and let n Tesupp(p) OPT(0(T)) be the set of 
configurations that are optimal for every tree-structured 
distribution. Then the following containment always 
holds: 

n re8 u P p(p) OPT(0(T)) C OPT(0). (22) 



Moreover, the bound ( I20> is tight if and only if the 
intersection on the LHS is non-empty. 

Proof: The containment relation is clear from 
the form of the upper bound ( I20> . Let x* belong to 
OPT(0). Then the difference of the RHS and the LHS 
of equation J20i can be written as follows: 







[X>(T)<M*cn)] -*oc(0) 

T 

[£p(T)<M0(T))] -(0, 0(x*)> 



= ^p(T)[$oo(0(T)) - (0(T), 0(x*))], 

T 

where the last equality uses the fact that 

EtP( T )0( T ) = ^ Now for each T e supp(p), 
the term $oo(0(T)) - (0(T), 0(x*)> is non- 
negative, and equal to zero only when x* belongs 
to OPT(0(T)). Therefore, the bound is tight if and only 
if x* G n Tesupp(p) OPT(0(T)) for some x* G OPT(0). 



The preceding result shows that the upper bound J20i 
is tight if and only if all the trees in the support of 
p agree on a common configuration. When this tree 
agreement condition holds, a MAP configuration for 
the original problem p(x; 6) can be obtained simply by 
examining the intersection rVesuppO) OPT(0(T)) of 
configurations that are optimal on every tree for which 
p(T) > 0. Accordingly, we focus our attention on the 
problem of finding upper bounds i20i that are tight, 
so that a MAP configuration can be obtained. Since 
the target parameter 6 is fixed and assuming that we 
fix the spanning tree distribution p, the problem on 
which we focus is that of optimizing the upper bound 
as a function of 6 G A p (9). Proposition suggests two 
different strategies for attempting to find a tight upper 
bound, which are the subjects of the next two sections: 

Direct minimization and Lagrangian duality: The first 
approach is a direct one, based on minimizing equa- 
tion (I20> . In particular, for a fixed distribution p over 
spanning trees, we consider the constrained optimization 
problem of minimizing the RHS of equation (I20> subject 



to the constraint 6 G A p (9). The problem structure 
ensures that strong duality holds, so that it can be tackled 
via its Lagrangian dual. In Section llVl we show that this 
dual problem is a linear programming (LP) relaxation of 
the original MAP estimation problem. 

Message-passing approach: In Section [V] we derive 
and analyze message-passing algorithms, the goal of 
which is to find, for a fixed distribution p, a collection 
of exponential parameters 6* = {9*(T)} such that 9* 
belongs to the constraint set A p (9) of equation dl9> . 
and the intersection n T OPT(0*(T)) of configurations 
optimal for all tree problems is non-empty. Under these 
conditions, Proposition [2 guarantees that for all con- 
figurations in the intersection, the bound is tight. In 
Section [V] we develop a class of message-passing algo- 
rithms with these two goals in mind. We also prove that 
when the bound is tight, fixed points of these algorithms 
specify optimal solutions to the LP relaxation derived in 
Section HV1 

IV. Lagrangian duality and tree relaxation 

In this section, we develop and analyze a Lagrangian 
reformulation of the problem of optimizing the upper 
bounds — i.e., minimizing the RHS of equation (I20> 
as a function of 9 G A p (9). The cost function is a 
linear combination of convex functions, and so is also 
convex as a function of 9; moreover, the constraints are 
linear in 9. Therefore, the minimization problem can be 
solved via its Lagrangian dual. Before deriving this dual, 
it is convenient to develop an alternative representation 
of $oo as a linear program. 

A. Linear program over the marginal polytope for exact 
MAP estimation 

Recall from equation (II 31 that the function value 
&oo(9) corresponds to the optimal value of the integer 
program max x (0, 0(x)). We now reformulate this inte- 
ger program as a linear program (LP), which leads to an 
alternative representation of the function $00, and hence 
of the exact MAP estimation problem. In order to convert 
from integer to linear program, our approach is the stan- 
dard one [e.g., 7], [38] of taking convex combinations 
of all possible solutions. The resulting convex hull is 
precisely the marginal polytope MARG(G) defined in 
Section Hl-CI We summarize in the following: 

Lemma 1: The function $00(0) has an alternative 
representation as a linear program over the marginal 
polytope: 



$oo(9) 



max (9, u) 

fi£MARG(G) 



(23) 



where (9, p) is shorthand for the sum 

J2seV J2j l 1 s;j9s- 3 + J2(s,t)eE Ylj,k l i st;jk9 s t;jk- 



Proof: Although this type of LP reformulation 
is standard in combinatorial optimization, we provide 
a proof here for completeness. Consider the set V := 
{ p{-) | p(x) > 0, J2 x p( x ) = 1} °f a ^ possible 
probability distributions over x. We first claim that the 
maximization over x 6 X n can be rewritten as an 
equivalent maximization over V as follows: 



max(#, d>(x)) = max 



E 



p(x)(9, 0(x)) 24) 



On one hand, the RHS is certainly greater than or equal 
to the LHS, because for any configuration x*, the set V 
includes the delta distribution that places all its mass 
at x*. On the other hand, for any p £ V, the sum 
Sxe*" p( x )(^' <K X )) i s a convex combination of terms 
of the form (9, </>(x)} for x 6 X n , and so cannot be any 
larger than max x (#, </>(x)}. 

Making use of the functional notation in equation il It . 
we now expand the summation on the RHS of equa- 
tion and then use the linearity of expectation to 
write: 



^2 p(x)lj2 e s(x s )+ d st(x s ,x t )\ = 

- 1 {s,t)eE > 

2_j E Ost-JkHst-jk- 



s£V 

E E j 

sEV jeX g {s,t)£E (j,k)£X B xX t 



Here fi sij := X^xga-" p( x )*j (^J and Mat;jft : = 
S xe ^»p(x)5 ifc (a; s ,x t ). As p ranges over V, the 
marginals /i range over MARG(G). Therefore, 
we conclude that max xe ^»(0, <fi(x)) is equal to 
max MeMA RG(G) M)» as claimed. ■ 



Remarks: Lemma ^ identifies $oo(#) as the sm/?- 
port function [28] of the set MARG(G). Consequently, 
$00 (6>) can be interpreted as the negative intercept of the 
supporting hyperplane to MARG(G) with normal vector 
9 £ M. d . This property underlies the dual relation that is 
the focus of the following section. 

B. Lagrangian dual 

Let us now address the problem of finding the tightest 
upper bound of the form in equation (I20> . More formally, 
for a fixed distribution p over spanning trees, we wish 
to solve the constrained optimization problem: 



mm EtP(T)^(9(T)) 



s.t. 



(25) 



EtP(T)9(T) = 9. 

As defined in equation (1161 . the constraint set £ consists 
of all vectors 9 — {9(T)} such that for each tree T, the 
subvector 9(T) respects the structure of T, meaning that 

9 a (T) = VaeI\I(T). 



Note that the cost function is a convex combination of 
convex functions; moreover, with p fixed, the constraints 
are all linear in 6. Under these conditions, strong duality 
holds [6], so that this constrained optimization problem 
can be tackled via its Lagrangian dual. The dual for- 
mulation turns out to have a surprisingly simple and 
intuitive form. In particular, recall the set LOCAL(G) 
defined the orthant constraint r G Mr:, and the additional 
linear constraints ( l9al . \9b\ and dlOi . The polytope 
LOCAL(G) turns out to be the constraint set in the dual 
reformulation of our problem: 

Theorem 1: The Lagrangian dual to problem d25l > is 
given by the LP relaxation based on LOCAL(G). Given 
that strong duality holds, the optimal primal value 

min $oc(0(T)) (26) 

0G£and £ T p( T ) e ( T )p( T ) 

is equal to the optimal dual value 

max {/ y >T S ;jds;j+ ^ }T at ;jkSat;jk}- 

tglocal(g) 1 ^— ' ^ ^ 

sev j (s,t)eE (j,k) 

(27) 

Proof: Let r be a vector of Lagrange multipliers 
corresponding to the constraints E p [6>(T)] = 9. We then 
form the Lagrangian associated with problem i25\ : 

C ptS (d,r) = E p I<P oc (9(T))} + (t,9-Y,p( T W T » 

T 

= £>(T) [^ QO (9(T)) - (9(T), r)] + (r, 

T 

We now compute the dual function Q p g(r) := 
infest C p g(0, r); this minimization problem can be 
decomposed into separate problems on each tree as 
follows: 

X> T ) e(T g (T) [*~(*(T)) - (9(T), r>] + (r, 9). 

(28) 

The following lemma is key to computing this infimum: 



Lemma 2: The function 

f(r) = sup {(e(T),r}-^ 00 (6(T))}(29) 

9(T)e£(T) 

has the explicit form 



if tgLOCAL(G;T) 
+oo otherwise, 



(30) 



where 



LOCAL(G; T) := { r 6 R d + | ^ r s;j = 1 Vs 6 V, 

T s t; j k=T t ; k V{s,t)£E(T) } 



jex„ 

Proof: See Appendix | 



Using Lemma |2j the value of the infimum d28l > will 
be equal to (r, 6) if t e LOCAL(G; T) for all T € 
supp(p), and — oo otherwise. Since every edge in the 
graph belongs to at least one tree in supp(p), we have 
n Tesup p (p) LOCAL(G;T) = LOCAL(G), so that the 
dual function takes the form: 

f(r, 0) if r S LOCAL(G) 
I — oo otherwise. 

Thus, the dual optimal value is max rS LocAL(G)(' r i 
by strong duality [6], this optimum is equal to the 
optimal primal value d25l >. ■ 

The equivalence guaranteed by the duality relation in 
Theorem \l\ is useful, because the dual problem has 
much simpler structure than the primal problem. For a 
general distribution p, the primal problem d25i entails 
minimizing a sum of functions over all spanning trees 
of the graph, which can be a very large collection. In 
contrast, the dual program on the RHS of equation ( I27i 
is simply a linear program (LP) over LOCAL(G), which 
is a relatively simple polytope. (In particular, it can 
be characterized by 0(mn + m 2 \E\) constraints, where 
m := max s \X a \.) 

This dual LP (I27i also has a very natural interpre- 
tation. In particular, the set LOCAL(G) is an outer 
bound on the marginal polytope MARG(G), since any 
valid marginal vector must satisfy all of the constraints 
defining LOCAL(G). Thus, the dual LP is simply 
a relaxation of the original LP (I23i . obtained by re- 
placing the original constraint set MARG(G) by the set 
LOCAL(G) formed by local (node and edgewise) con- 
straints. Note that for a graph with cycles, LOCAL(G) is 
a strict superset of MARG(G). (In particular, Example[3] 
to follow provides an explicit construction of an element 
t e LOCAL(G)\ MARG(G).) For this reason, we call 
any r s LOCAL(G) a pseudomarginal vector. 

An additional interesting connection is that this poly- 
tope LOCAL(G) is equivalent to the constraint set 
involved in the Bethe variational principle which, as 
shown by Yedidia et al. [47], underlies the sum-product 
algorithm. In addition, it is possible to recover this 
LP relaxation as the "zero-temperature" limit of an 
optimization problem based on a convexified Bethe ap- 
proximation, as discussed in our previous work [44]. For 
binary variables, the linear program 1271 can be shown 
to be equivalent to a relaxation that has been studied 
in previous work [e.g., 27], [11]. The derivation given 
here illuminates the critical role of graphical structure in 
controlling the tightness of such a relaxation. In partic- 
ular, an immediate consequence of our development is 
the following: 



Corollary 1: The relaxation J27t is exact for any 
problem on a tree-structured graph. 

Since the LP relaxation J27I is always exact for 
MAP estimation with any tree-structured distribution, 
we refer to it as a tree relaxation. For a graph with 
cycles — in sharp contrast to the tree-structured case 
— LOCAL(G) is a strict outer bound on MARG(G), 
and the relaxation J27l > will not always be tight. Figure [2] 
provides an idealized illustration of LOCAL(G), and its 
relation to the exact marginal polytope MARG(G). It 
can be seen that the vertices of MARG(G) are all of the 
form jUj, corresponding to the marginal vector realized 
by the delta distribution that puts all its mass on J e X n . 
In the canonical overcomplete representation 0, each 
element of any such is either zero or one. These 
integral vertices, denoted by /i; n f, are drawn with black 
circles in Figure |2j a). It is straightforward to show that 
each such fj,j is also a vertex of LOCAL(G). However, 
for graphs with cycles, LOCAL(G) includes additional 
fractional vertices that lie strictly outside of MARG(G), 
and that are drawn in gray circles in Figure |2ja). 

Since LOCAL(G) is also a polytope, the optimum 
of the LP relaxation J27l > will be attained at a vertex 
(possibly more than one) of LOCAL(G). Consequently, 
solving the LP relaxation using LOCAL(G) as an outer 
bound on MARG(G) can have one of two possible 
outcomes. The first possibility is that optimum is attained 
at some vertex of LOCAL(G) that is also a vertex of 
MARG(G). The optimum may occur at a unique integral 
vertex, as illustrated in panel (b), or at multiple integral 
vertices (not illustrated here). In this case, both the dual 
LP relaxation J27b . and hence also the primal version in 
equation J20l >. are tight, and we can recover an optimal 
MAP configuration for the original problem, which is 
consistent with Proposition^ Alternatively, the optimum 
is attained only outside the original marginal polytope 
MARG(G) at a fractional vertex of LOCAL(G), as 
illustrated in panel (c). In this case, the relaxation must 
be loose, so that Proposition^asserts that it is impossible 
to find a configuration that is optimal for all tree- 
structured problems. Consequently, whether or not the 
tree agreement condition of Proposition^can be satisfied 
corresponds precisely to the distinction between integral 
and fractional vertices in the LP relaxation J27b . 

Example 3 (Integral versus fractional vertices): In 
order to demonstrate explicitly the distinction between 
fractional and integral vertices, we now consider the 
simplest possible example — namely, a binary problem 
x e {0, l} 3 defined on the 3 -node cycle. Consider the 
parameter vector 9 with components defined as follows: 




\ LOCAL(G) 

(a) (b) (c) 

Fig. 2. (a) The constraint set LOCAL(G) is an outer bound on the exact marginal polytope. Its vertex set includes 
all the integral vertices of MARG(G), which are in one-to-one correspondence with optimal solutions of the integer 
program. It also includes additional fractional vertices, which are not vertices of MARG(G). (b)- (c) Solving a LP 
with cost vector 9 entails translating a hyperplane with normal 6 until it is tangent to the constraint set LOCAL(G). 
In (b), the point of tangency occurs at a unique integral vertex. In (c), the tangency occurs at a fractional vertex of 
LOCAL(G) that lies outside of MARG(G). 
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Suppose first that (3 is positive — say (3 = 1. By 
construction of 9, we have (9, r) < for all r G 
LOCAL(G). This inequality is tight when r is either the 
vertex no corresponding to the configuration [0 0], or 
its counterpart [i\ corresponding to [1 1 1]. In fact, both 
of these configurations are MAP-optimal for the original 
problem, so that we conclude that the LP relaxation J27> 
is tight (i.e., we can achieve tree agreement). 

On the other hand, suppose that (3 < 0; for concise- 
ness, say (3 = — 1. This choice of 9 encourages all pairs 
of configurations (x s ,xt) to be distinct (i.e., x s ^ xt). 
However, in going around the cycle, there must hold 
x a = Xt for at least one pair. Therefore, the set of 
optimal configurations consists of [1 1], and the other 
five permutations thereof. (I.e., all configurations except 
[1 1 1] and [0 0] are optimal). The value of any such 
optimizing configuration — i.e., max Me MARG(G) (8, A*) 
— will be —2(3 > 0, corresponding to the fact that two 
of the three pairs are distinct. 

However, with reference to the relaxed polytope 
LOCAL(G), a larger value of (8, r) can be attained. 
We begin by observing that (9, r) < —3/3 for all 
r G LOCAL(G). In fact, equality can be achieved in 
this inequality by the following pseudomarginal: 
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Overall, we have shown that max reL ocAL{G) r ) — 
-3(3 > -2(3 = max MeMA RG(G)<^, /">, which 
establishes that the relaxation (I27> is loose for this 
particular problem. Moreover, the pseudomarginal vector 
r defined in equation (I32ai corresponds to a fractional 
vertex of LOCAL(G), so that we are in the geometric 
setting of Figure EJc). (} 

V. Tree-reweighted message-passing 

ALGORITHMS 

The main result of the preceding section is that the 
problem of finding tight upper bounds, as formulated in 
equation i25\ . is equivalent to solving the relaxed linear 
program (I27> over the constraint set LOCAL(G). A key 
property of this constraint set is that it is defined by a 
number of constraints that is at most quadratic in the 
number of nodes n. Solving an LP over LOCAL(G), 
then, is certainly feasible by various generic methods, 
including the simplex algorithm [e.g., 7]. It is also of 
interest to develop algorithms that exploit the graphical 
structure intrinsic to the problem. Accordingly, this sec- 
tion is devoted to the development of iterative algorithms 
with this property. An interesting property of the iterative 
methods developed here is that when applied to a tree- 
structured graph, they all reduce to the ordinary max- 
product algorithm [23], [43]. For graphs with cycles, 
in contrast, they remain closely related to but nonethe- 
less differ from the ordinary max-product algorithm 
in subtle but important ways. Ultimately, we establish 
a connection between particular fixed points of these 
iterative algorithms and optimal dual solutions of the 
LP relaxation (I27> . In this way, we show that (suitably 
reweighted) forms of the max-product algorithm have a 
variational interpretation in terms of the LP relaxation. 



As a corollary, our results show that the ordinary max- 
product algorithm for trees (i.e., the Viterbi algorithm) 
can be viewed as an iterative method for solving a 
particular linear program. 

We begin with some background on the notion of 
max-marginals, and their utility in computing exact MAP 
estimates of tree-structured distributions [2], [14], [15], 
[43]. We then define an analogous notion of pseudo-max- 
marginals for graphs with cycles, which play a central 
role in the message-passing algorithms that we develop 
subsequently. 

A. Max-marginals for tree-distributions 

Although the notion of max-marginal can be defined 
for any distribution, of particular interest in the current 
context are the max-marginals associated with a distri- 
bution p(x;0(T)) that is Markov with respect to some 
tree T = (V,E(T)). For each s G V and j G X s , 
the associated single node max-marginal is defined by 
a maximization over all other nodes in the graph 

v s .j := k s max p(x;0(T)). (33) 

{x | x s =j} 

Here k s > is some normalization constant, included 
for convenience, that is independent of j but can vary 
from node to node. Consequently, the max-marginal v s - j 
is proportional to the probability of the most likely 
configuration under the constraint x s — j. Note that v s -j 
is obtained by maximizing over the random variables 
at all nodes t ^ s, whence the terminology "max- 
marginal". For each edge (s, t), the joint pairwise max- 
marginal is defined in an analogous manner: 

v*t;jk ■= K st max p(x;0(T)). (34) 

{x I (x B ,x t ) = (j,k)} 

Once again, the quantity K st is a positive normalization 
constant that can vary from edge to edge but does not 
depend on (j, k). 

It is convenient to represent all the values {v S ;j>3 £ 
X s } associated with a given node, and the values 
{v s t;jk, (j, k) € X s X Xt] associated with a given edge 
in the functional form: 

v s (x s ) := ^ ^s;jSj(x s ), 
jex s 

v st {x s ,x t ) := ^2 v st . jk 5j{x s )5 k (x t ). 

It is well-known [14] that any tree-structured distribu- 
tion p(x; 0(T)) can be factorized in terms of its max- 
marginals as follows: 

p(x;0(T)) cx J] II 7 ( w^ 6 > 

sZV (s,i)£E(r) l/ s{x s )h l t(xt) 



This factorization, which is entirely analogous to the 
more familiar one in terms of (sum)-marginals, is a 
special case of the more general junction tree decom- 
position [15], [14]. Moreover, it can be shown [15], 
[43] that the ordinary max-product (min-sum) algorithm 
computes this max-marginal factorization. The fact that 
this factorization can be computed in a straightforward 
manner for any tree is exploited in the algorithms that 
we develop in the sequel. 

The max-marginal factorization d36l yields a local 
criterion for assessing the validity of tree max-marginals. 
The following lemma provides a precise statement: 

Lemma 3: A collection {v s ,v s t} are valid max- 
marginals for a tree if and only if the edgewise con- 
sistency condition 

v s (x s ) = k maxi/ si (3; s ,ij) (37) 
x' t ex t 

holds 5 for every edge (s,t) G E(T). 

Proof: Necessity of the edge consistency is clear. 
The sufficiency can be established by an inductive ar- 
gument in which successive nodes are stripped from the 
tree by local maximization; see [15], [43] for further 
details. ■ 



The max-marginal representation d36i allows the 
global problem of MAP estimation to be solved by 
performing a set of local maximization operations. 
In particular, suppose that the configuration x* be- 
longs to OPT(0(T)), meaning that is MAP-optimal for 
p(x;0(T)). For a tree, such configurations are com- 
pletely characterized by local optimality conditions with 
respect to the max-marginals, as summarized in the 
following: 

Lemma 4 (Local optimality): Let {is s , v st } be a valid 
set of max-marginals for a tree-structured graph. Then a 
configuration x* belongs to OPT(0(T)) if and only if 
the following local optimality conditions hold: 

£* G argmaxi/ s (i s ) V s, (38a) 

x s 

[x* x* t ) G argmaxi/ st (x s ,xt) V (s,t) (38b) 

x 3 ,x t 

Proof: The necessity of the conditions in equa- 
tion J38i is clear. To establish sufficiency, we follow 
a dynamic-programming procedure. Any tree can be 
rooted at a particular node rcF, and all edges can be 
directed from parent to child (s — > t). To find a configu- 
ration x* G OPT(0(T)), begin by choosing an element 
x* G argmax Xr v r (x r ). Then proceed recursively down 
the tree, from parent s to child t, at each step choosing 
the child configuration xjf from argmax^ v st (x*, Xt). 

5 Here n is a positive constant that depends on both the edge, and 
the variable over which the maximization takes place. 
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By construction, the configuration x* so defined is MAP- 
optimal; see [15], [43] for further details. ■ 

A particularly simple condition under which the local 
optimality conditions J38I hold is when for each s G 
V, the max-marginal v s has a unique optimum x*. In 
this case, the MAP configuration x* is unique with 
elements x* = argmax 2 . s v s {x s ) that are computed 
easily by maximizing each single node max-marginal. 
If this uniqueness condition does not hold, then more 
than one configuration is MAP-optimal for p(x;9). In 
this case, maximizing each single node max-marginal is 
no longer sufficient [43], and the dynamic -programming 
procedure described in the proof of Lemma |4] must be 
used. 

B. Iterative algorithms 

We now turn to the development of iterative algo- 
rithms for a graph G — (V, E) that contains cycles. We 
begin with a high-level overview of the concepts and 
objectives, before proceeding to a precise description. 

1 ) High level view: The notion of max-marginal is not 
limited to distributions defined by tree-structured graphs, 
but can also be defined for graphs with cycles. Indeed, 
if we were able to compute the exact max-marginals of 
p(x; 9) and each single node max-marginal had a unique 
optimum, then the MAP estimation problem could be 
solved by local optimizations. 6 However, computing 
exact max-marginals for a distribution on a general graph 
with cycles is an intractable task. Therefore, it is again 
necessary to relax our requirements. 

The basic idea, then, is to consider a vector of so- 
called pseudo-max-marginals v :— {v s ,v st }, the prop- 
erties of which are to be defined shortly. The qualifier 
"pseudo" reflects the fact that these quantities no longer 
have an interpretation as exact max-marginals, but in- 
stead represent approximations to max-marginals on the 
graph with cycles. For a given distribution p over the 
spanning trees of the graph G and a tree T for which 
p(T) > 0, consider the subset v{T), corresponding to 
those elements of v associated with T — i.e., 

u(T) := {v s , seV} U {v st , (s,t) G E(T)}39) 

We think of v{T) as implicitly specifying a tree- 
structured exponential parameter via the factoriza- 
tion (1361 . i.e.: 



u(T) 



6(T), VTGsupp(p), (40) 



which in turn implies that v is associated with a collec- 
tion of tree-structured parameters — viz.: 



v 



6 := {0(T) | T G supp(p)}. (41) 



Now suppose that given p, we have a vector v that 
satisfies the following properties: 

(a) The vector v specifies a vector 9 G A p (9), 
meaning that 6 is a p-reparameterization of the 
original distribution. 

(b) For all trees T G supp(p), the vector v(T) consists 
of exact max-marginals for p(x; 9(T)); we refer to 
this condition as tree consistency. 

Our goal is to iteratively adjust the elements of v — 
and hence, implicitly, 6 as well — such that the p- 
reparameterization condition always holds, and the tree 
consistency condition is achieved upon convergence. In 
particular, we provide algorithms such that any fixed 
point v* satisfies both conditions (a) and (b). 

The ultimate goal is to use v* to obtain a MAP 
configuration for the target distribution p(x;0). The 
following condition turns out to be critical in determining 
whether or not v* is useful for this purpose: 
Optimum specification: The pseudo-max-marginals 
{r/*,^* t } satisfy the optimum specification (OS) crite- 
rion if there exists at least one configuration x* that 
satisfies the local optimality conditions J38i for every 
vertex s G V and edge (s, t) G E on the graph with 
cycles. 

Note that the OS criterion always holds for any set of 
exact max-marginals on any graph. For the pseudo-max- 
marginals updated by the message-passing algorithms, 
in contrast, the OS criterion is no longer guaranteed to 
hold, as we illustrate in Example |4] to follow. 

In the sequel, we establish that when v* satisfies 
the OS criterion with respect to some configuration 
x*, then any such x* must be MAP-optimal for the 
target distribution. In contrast, when the OS criterion is 
not satisfied, the pseudo-max-marginals {v*, i/* t } do not 
specify a MAP-optimal configuration, as can be seen by 
a continuation of Example 

Example 4 (Failure of the OS criterion): Consider 
the parameter vector 9 defined in equation J3 1 al . Let 
the spanning tree distribution p place mass 1 /3 on each 
of the three spanning trees associated with a 3-node 
single cycle. With this choice, the edge appearances 
probabilities are p st = 2/3 for each edge. We now 
define a 2-vector logi^* of log pseudo-max-marginals 
associated with node s, as well as a 2 x 2 matrix log v* t 
of log pseudo-max-marginals associated with edge 
(s, t), in the following way: 



log v* s 
log Kt 



■= [o o] 

1 



(2/3) 



Vsey 

-0 

-p 



V(s,t) G E. 



6 If a subset of the single node max-marginals had multiple optima, 
the situation would be more complicated. 



For each of the three trees in supp(p), the associ- 
ated vector v*(T) of pseudo-max-marginals defines a 
tree-structured exponential parameter 9*{T) as in equa- 



tion ( I40t . More specifically, we have 

e;(T) = bg^; v s &v, 

e * t{T) = flog^ V(s,t)eE(T), 
[ otherwise 

With this definition, it is straightforward to verify that 
2~2t = ^> meaning that the p-reparameterization 

condition holds. Moreover, for any j3 G M, the 
edgewise consistency condition max^ v* t (x s , xi) = 
k i/*(i s ) holds. Therefore, the pseudo-max-marginals 
are pairwise-consistent, so that by Lemma |3j they are 
tree-consistent for all three spanning trees. 

Now suppose that (3 > 0. In this case, the pseudo- 
max-marginal vector v* does satisfy the OS criterion. 
Indeed, both configurations [0 0] and [1 1 1] 
achieve argmax^ v* s [xs) for all vertices s € V, and 
argmax^ v lti x si x t) for all edges (s,t) e E. This 
finding is consistent with Example [3] where we demon- 
strated that both configurations are MAP-optimal for the 
original problem, and that the LP relaxation (I27> is tight. 

Conversely, suppose that (3 < 0. In this 
case, the requirement that x* belong to the set 
argmax^, x v* t (x s ,Xt) for all three edges means that 
x* ^ x* t for all three pairs. Since this condition cannot 
be met, the pseudo-max-marginal fails the OS criterion 
for (3 < 0. Again, this is consistent with Example [5] 
where we found that for (3 < 0, the optimum of the LP 
relaxation J27l > was attained only at a fractional vertex. 



2) Direct updating of pseudo-max-marginals: Our 
first algorithm is based on updating a collection of 
pseudo-max-marginals {v s ,v st } for a graph with cy- 
cles such that p-reparameterization (condition (a)) holds 
at every iteration, and tree consistency (condition (b)) 
is satisfied upon convergence. At each iteration n = 
0,1,2,..., associated with each node s <E V is a 
single node pseudo-max-marginal v™, and with each 
edge (s, t) S E is a joint pairwise pseudo-max-marginal 
Vg t . Suppose that for each tree T in the support of p, 
we use these pseudo-max-marginals {v™, v™ t } to define 
a tree-structured exponential parameter 9 n (T) via equa- 
tion d36i . More precisely, again using the functional no- 
tation as in equation (II It . the tree-structured parameter 
6 n (T) is defined in terms of (element- wise) logarithms 
of v n as follows: 

e n s {T){x s ) = log^(z s ) y SGV (44a) 

f log if%s) , if (s, t) e E(T\ 
9 n st {T){x s ,x t ) = I ^K(.Xs)Kixt) ^ ' I (zkb) 
I otherwise 

The general idea is to update the pseudo-max- 
marginals iteratively, in such a way that the p- 
reparameterization condition is maintained, and the tree 
consistency condition is satisfied upon convergence. 



There are a number of ways in which such updates can 
be structured; here we distinguish two broad classes of 
strategies: tree-based updates, and parallel edge-based 
updates. Tree-based updates entail performing multiple 
iterations of updating on a fixed tree T € supp(p), 
updating only the subcollection v(T) of pseudo-max- 
marginals associated with vertices and edges in T until 
it is fully tree-consistent for this tree (i.e., so that the 
components of v{T) are indeed max-marginals for the 
distribution p(x;9(T))). However, by focusing on this 
one tree, we may be changing some of the v s and v st so 
that we do not have tree-consistency on one or more of 
the other trees T 1 e supp(p). Thus, the next step entails 
updating the pseudo-max-marginals v{T') on one of the 
other trees and so on, until ultimately the full collection 
v is consistent on every tree. In contrast, the edge-based 
strategy involves updating the pseudo-max-marginal v st 
on each edge, as well as the associated single node max- 
marginals v s and v t , in parallel. This edgewise strategy 
is motivated by Lemma |3j which guarantees that v is 
consistent on every tree of the graph if and only if the 
edge consistency condition 

max v st {x s ,x' t ) = nv s (x s ) (45) 
x' t ex t 

holds for every edge (s, t) of the graph with cycles. 

It should be noted that tree-based updates are compu- 
tationally feasible only when the support of the spanning 
tree distribution p consists of a manageable number of 
trees. When applicable, however, there can be important 
practical benefits to tree-based updates, including more 
rapid convergence as well as the possibility of determin- 
ing a MAP-optimal configuration prior to convergence. 
More details on tree-based updates and their properties 
in more detail in Appendix [C] We provide some ex- 
perimental results demonstrating the advantages of tree- 
based updates in Section IV-D. 21 

Here we focus on edge-based updates, due their 
simplicity and close link to the ordinary max-product 
algorithm that will be explored in the following section. 
The edge-based reparameterization algorithm takes the 
form shown in Figure |3j a few properties are worthy 
of comment. First, each scalar p st appearing in equa- 
tions J46bt and ( 147 ai is the edge appearance probability 
of edge (s, t) induced by the spanning tree distribution 
p, as defined in Section Ull- AI Second, this edge-based 
reparameterization algorithm is very closely related to 
the ordinary max-product algorithm [43]. In fact, if 
p s t = 1 for all edges (s, t) G E, then the updates 
are exactly equivalent to a (reparameterization form) 
of the usual max-product updates. We will see this 
equivalence explicitly in our subsequent presentation of 
tree-reweighted message-passing updates. 





Algorithm 1: Edge-based reparameterization updates: 

1. Initialize the pseudo-max-marginals {v®,^} in terms of the original exponential parameter vector as 
follows: 

z^(x s ) = k exp (9 s (x s )) (46a) 
v%(x s ,x t ) = k exp ( — st (x s ,x t ) + t (x t ) + s (x s )) (46b) 

Pst 



2. For iterations n = 0, 1, 2, . . ., update the pseudo-max-marginals as follows: 

max x j v™ t (x s ,x' t )-[pst 



^ +l {x s ) = n^{x s ) [] [ 



yV-(x s ) 



(47a) 



K t +1 (x s ,x t ) = k ^(^) i/^OcK^fc) (47b) 

max l{ i/« (x s , 4) max x > i/" t (^,a: t ) 



Fig. 3: Edge-based reparameterization updates of the pseudo-max-marginals. 



The following lemmas summarize the key properties 
of the Algorithm^ We begin by claiming that all iterates 
of this algorithm specify a p-reparameterization: 

Lemma 5 (p-reparameterization): At each iteration 
n = 0, 1, 2, . . ., the collection of tree-structured param- 
eter vectors 9 n — {6 n (T)}, as specified by the pseudo- 
max-marginals {v™,v™ t } via equation J44> . satisfies the 
p-reparameterization condition. 

Proof: Using the initialization J46i and equa- 
tion (I44i . for each tree T G supp(p), we have the 
relation 6° st (T) =J st /p st for all edges (s,t) G £(T), 
and 6>°(T) = S for all vertices s £ V. Thus 
we have £ T p(T)0°(T) = for all s G V and 
E T p(T)0° st (T) = ET 3(s , t )P(7 1 )fe = st for all 
(s, f) G E, so that p-reparameterization holds for n = 0. 
We now proceed inductively: supposing that it holds for 
iteration n, we prove that it also holds for iteration n + 1. 
Using the update equation J47b and equation J44t . we 
find that for all x G X, the quantity 6>' i+1 (T)(x) is equal 
to 



J2\ log^(z s )+ P stl °S 
sev ^ ter(s) 



max i; v™ t (x s ,x' t ) 



(a,t)£B(T) 

Some algebraic re-arrangement leads to an equivalent 
expression (up to additive constants independent of x) 
for the weighted sum J2t /°( T ) 6 ' n+1 ( r )( x ): 



(s,t)eE 



which, using equation J44i . is seen to be equal to 
£ T p(T)0 n (T)(x). Thus, the statement follows by the 
induction hypothesis. ■ 



Next we characterize the fixed points of the updates in 
step 2: 

Lemma 6: Any fixed point v* of the updates 
satisfies the tree consistency condition (b). 

Proof: At a fixed point, we can substitute v* — 
v n _ j/i+i at a jj pi aces i n the updates. Doing so in 
equation d47bl and cancelling out common terms leads 
to the relation 



v' t Kt {x s ,x' t ) max,- v* t {x' 8 ,x t ) 



1 



for all (x s ,Xt), from which the edgewise consistency 
condition ( I45> follows for each edge (s, t) G E. The 
tree consistency condition then follows from Lemma [3] 



3) Message-passing updates: The reparameterization 
updates of Algorithm [5] can also be described in terms 
of explicit message-passing operations. In this formula- 
tion, the pseudo-max-marginals depend on the original 
exponential parameter vector 8, as well as a set of 
auxiliary quantities M st (-) associated with the edges 
of G. For each edge (s, t) G E, M st : X t ->■ M+ 
is a function from the state space X t to the positive 
reals. The function M st (-) represents information that 
is relayed from node s to node t, so that we refer to it 
as a "message". The resulting algorithm is an alternative 
but equivalent implementation of the reparameterization 
updates of Algorithmic 

More explicitly, let us define pseudo-max-marginals 
{v s , v st } in terms of 9 and a given set of messages M = 



{M st ] as follows: 

v„{x.) cx exp(S a (x a )) J] [M vs (x s )]frU) 

wer(s) 

n [m vs (x s )] pvs 

ver(s)\t 



u st (x s ,x t ) oc v>gt(x s ,a; t ) 



l(l-Pts) 



Mt,(jB a )]' 

n [^(^)]"" t 

x " er( ' )X ' n . (48b) 

where (j) st (x s ,x t ) := exp (-^8 st {x s , x t ) + 9 s (x s ) + 
9t(xt))- As before, these pseudo-max-marginals can be 
used to define a collection of tree-structured exponen- 
tial parameters = {9(T)} via equation (1441 . First, 
we claim that for any choice of messages, the set 
of tree-structured parameters so defined specifies a p- 
reparameterization: 

Lemma 7: For any choice of messages, the collection 
{6(T)} is a p-reparameterization of 9. 

Proof: We use the definition g4) of 9{T) in terms 
of {v s , u st ) to write £ r p(T)0(T)(x) as 

u st {x s ,x t ) 



T s£V (s,t)£E{T) 

Expanding out the expectation yields 

i/ at (x s ,x t ) 



Vs{x s )v t (xt) 



E 

sGV 



logV s (x s ) + ^ Psttog 



v s {x s )vt{x t ) 



(49) 



(s,t)eE 

Using the definition J48i of v s and f s t, we have 
v st {x s ,x t ) 

p st log = 

d at {x s ,xt) - pst log M st (x t ) - p s t logM ts (x s ), 

As a consequence, each weighted log message 
PstlogMts(xs) appears twice in equation ( I49> : once 
in the term \ogi/ s (x s ) with a plus sign, and once 
in the term logVgt(x a> Xt)/i/ a (x a )vt(xt) with a neg- 
ative sign. Therefore, the messages all cancel in 
the summation. This establishes that for all x 6 
X», we have £ r p(T)0(T)(x) = Esev^(x s ) + 

Y,(s,t)eE d st(Xs,X t ). M 

The message-passing updates shown in Figure 0] are 
designed to find a collection of pseudo-max-marginals 
{v S: v s t} that satisfy the tree consistency condition (b). 



polytope must necessarily satisfy p st = 1 for every edge 
(s,t) G E, so that Algorithm 2 reduces to the ordinary 
max-product update. However, if G has cycles, then it is 
impossible to have p st = 1 for every edge (s, t) € E, so 
that the updates in equation (I50i differ from the ordinary 
max-product updates in three critical ways. To begin, 
the exponential parameters 9 s t(x s , Xt) are scaled by the 
(inverse of the) edge appearance probability l/p s t > 1- 
Second, for each neighbor v £ T(t)\s, the incoming 
message M vt is exponentiated by the corresponding edge 
appearance probability p vt < 1. Last of all , the 
update of message M ts — that is, from t to s along edge 
(s,t) — depends on the reverse direction message M s t 
from s to t along the same edge. Despite these features, 
the messages can still be updated in an asynchronous 
manner, as in ordinary max-product [23], [43]. 

Moreover, we note that these the tree-reweighted 
updates are related but distinct from the attenuated max- 
product updates proposed by Frey and Koetter [24]. A 
feature common to both algorithms is the re-weighting 
of messages; however, unlike the tree-reweighted up- 
date (15 0> . the attenuated max-product update in [24] of 
the message from t to s does not involve the message 
in the reverse direction (i.e., from s to t). 

By construction, any fixed point of Algorithm 2 
specifies a set of tree-consistent pseudo-max-marginals, 
as summarized in the following: 

Lemma 8: For any fixed point M* of the updates i50\ . 
the associated pseudo-max-marginals v* defined as in 
equation ( 148 a> and (I48bl satisfy the tree-consistency 
condition. 

Proof: By Lemma [3] it suffices to verify that 
the edge consistency condition ( 145 \ holds for all edges 
(s,t) e E. Using the definition (08) of v* and v* t , the 
edge consistency condition (145 \ is equivalent to equating 

exv{9s{x s )) iLerts) [M vs (x s )] Pv3 with 



K max \ ip s t(xs,x' t ) 
x' t ex t L 



IWWV [ M vs(Xs)]' 



[M ts (x s )f- pts) 

IW)\a [ M vt(x' t )] PVt 



[Mst{x' t )} 



n -i(i-p»t) 



}■ 



Pulling out all terms involving Mt s (x 3 ), and canceling 
out all remaining common terms yields the message 
update equation ( I50l l. ■ 



First, it is worthwhile noting that the message update 
equation (I50i is closely related to the standard [23], 
[43] max-product updates, which correspond to taking 
Pst = 1 for every edge. On one hand, if the graph 
G is actually a tree, any vector in the spanning tree 



C. Existence and properties of fixed points 

We now consider various questions associated with 
Algorithms 1 and 2, including existence of fixed points, 
convergence of the updates, and the relation of fixed 



Algorithm 2: Parallel tree-reweighted max-product 

1. Initialize the messages M° = {M S J with arbitrary positive real numbers. 

2. For iterations n = 0, 1, 2, . . ., update the messages as follows: 

MI + \x s ) = k max\exp(—e st (x s ,x' t )+e t (x' t )^ ^ ver(t ^ s 
x' t ex t ^ \Pst 



M st( x 't)] 



(50) 



Fig. 4. Parallel edge-based form of tree-reweighted message-passing updates. The algorithm reduces to the ordinary 
max-product updates when all the edge weights p 3t are set equal to one. 



points to the LP relaxation J27i . As noted previously, 
the two algorithms (reparameterization and message- 
passing) represent alternative implementations of the 
same updates, and hence are equivalent in terms of their 
fixed point and convergence properties. For the purposes 
of the analysis given here, we focus on the message- 
passing updates given in Algorithm 2. 

With reference to the first question, in related 
work [43], we proved the existence of fixed points for 
the ordinary max-product algorithm when applied to any 
distribution with strictly positive compatibilities defined 
on an arbitrary graph. The same proof can be adapted 
to show that the message-update equation J50i has at 
least one fixed point M* under these same conditions. 
Unfortunately, we do not yet have sufficient conditions to 
guarantee convergence on graphs with cycles; however, 
in practice, we find that the edge-based message-passing 
updates J50i converge if suitably damped. In particu- 
lar, we apply damping in the logarithmic domain, so 
that messages are updated according to AlogM™™ + 
(1 — A)logM t ° s M , where Af f " et0 is calculated in equa- 
tion (I50i . Moreover, we note that in follow-up work, 
Kolmogorov [31] has developed a modified form of tree- 
based updates for which certain convergence properties 
are guaranteed. 

Finally, the following theorem addresses the nature 
of the fixed points, and in particular provides sufficient 
conditions for Algorithm 2 to yield exact MAP estimates 
for the target distribution p(x; 9), and thereby establishes 
a link to the dual of the LP relaxation of Theorem ^ 

Theorem 2: Let M* be a fixed point of Algorithm 2, 
and suppose that the associated pseudo-max-marginals 
v* satisfy the optimum specification (OS) criterion. Then 
the following statements hold: 

(a) Any configuration x* satisfying the local opti- 
mality conditions in the OS criterion is a MAP 
configuration for p(x; 9). 

(b) Let u>* = log M* be the logarithm of the fixed 
point M* taken element-wise. Then a linear com- 
bination of (j* specifies an optimal solution to the 
dual of the LP relaxation (I27> of Theorem [2 



Proof: (a) By Lemma[7] the pseudo-max-marginals 
v* specify a p-reparameterization of p(x; 9). Since the 
message vector M* defining v* is a fixed point of the 
update equation \5Q\ . Lemma [8] guarantees that the tree 
consistency condition holds. By the optimum specifi- 
cation (OS) criterion, we can find a configuration x* 
that is node and edgewise optimal for v* . By Lemma 0] 
the configuration x* is optimal for every tree-structured 
distribution p(x; 9* (T)) . Thus, by Proposition [T] the 
configuration x* is MAP-optimal for p(x; 9). 
(b) Let M* be a fixed point of the update equation < I50I . 
such that the pseudo-max-marginals v* satisfy the OS 
criterion. The proof involves showing that a linear 
combination of the vector to* := logM*, defined by 
the element-wise logarithm of the message fixed point, 
is an optimal solution to a particular Lagrangian dual 
reformulation of the LP relaxation (I27> . For this proof, 
it is convenient to represent any pseudomarginal r more 
compactly in the functional form 



t s (x s ) := ^ 

T st (x s ,X t ) : = 



jSj (x s 



jex s 

^ T st ;jk5j(x s )Sk(x t ). 
(j,k)ex s xx t 



For each edge (s,i) and element x s 6 X s , define the 
linear function C ts (a; s ) :=t s (x s ) - Yl,x' t ex t T st(x s ,x' t ), 
and let X ts (x s ) be a Lagrange multiplier associated 
with the constraint Ct s (x s ) = 0. We then consider the 
Lagrangian 

C 1p (t,\) = (tJ) + 

^ Pst ^ ^ts(x s )Ct s (x s )+ ^ X st (x t )C st (x t ) 

(s,t)eE *-x 3 £X s x t GX t 

Note that we have rescaled the Lagrange multipliers by 
the edge appearance probabilities p st > so as to make 
the connection to the messages in Algorithm 2 as explicit 
as possible. For any vector of Lagrange multipliers 
A, the dual function is defined by the maximization 
Qg-pW '■= max T£ 5 C§ (t, A), over the constraint set 



(52) 



edge. Consider the pseudomarginal vector given by 



S := {t| t > 0, ^t s (x s ) = 1, ^ T s t(x s ,x t ) = 1}. 

X s X s , X £ 

(53) 

Now our goal is to specify how to choose a partic- 
ular Lagrange multiplier vector A* in terms of the log 
messages u*, or equivalently the pseudo-max-marginals 
v* and v* st defined by the messages M* := exp(w*). 
To define the link between A* and v* , we let r be an 
arbitrary node of the graph, and suppose that every tree 
T e supp(p) is rooted at r, and the remaining edges are 
directed from parent-to-child. More formally, for each 
node s ^ r, let 7r T (s) denote its unique parent. We now 
define 

X* s (x s ) := ul(x.)- ]T p(T)log^( 3 (3) 

{T | 

With this definition, the Lagrangian evaluated at A* takes 
a particular form: 

Lemma 9: With A* defined in equation (I54> . the La- 
grangian Cg _(t, A*) can be written as 



Y,p{T)Hr,v*;T) 

T 

+ ^2 k s^2 t s( x s) 



Kst T st( x s,Xt)i 



where k s and K st are constants, and J-(t, v* ; T) is given 
by 



^2 ^2 T st( x s,X w T^)log 

s=£r {x s ,x n T( t )) 



+ '^2T r (x r )\ogv*(x r ). (55) 

x r 

Proof: See Appendix IDl ■ 



We now determine the form of the dual func- 
tion Q§: P ( X *y Note that T, x . T s( x <>) = 1 and 
J2x 3 x t T st(x s ,xt) = 1 on the constraint set S defined 
in equation J53I . so that the terms involving n s and n s t 
play no role in the optimization. Using the definition of 
Q and Lemma |9j we write 

Qg.JX*) = max^ p(T)F(t,v*;T) + k 



(«) 



< ^2p(T)max.F(T,v*;T) + K,(56) 



where k := J2 S K s + 2~2( s ,t) K st- 

Now since v* satisfies the OS criterion by assumption, 
we can find a vector x* that achieves argmax^ v*(x s ) 
for every node, and argmax KsjKt v* st {x s , xt) for every 



t*(x s ) := 5(x s =x*), (57a) 
T st (x s ,x t ) := 5(x s = x*)5(x t = x% ). (57b) 

The following lemma is proved in Appendix IE1 

Lemma 10: For each tree T, we have 
max TS 5 J-(j, v*;T) — T(j* , v*;T). 
This lemma shows that each of the maxima on the RHS 
of equation d56l > are achieved at the same r*, so that the 
inequality labeled (a) in equation d56l > in fact holds with 
equality. Consequently, we have 

^(A*)=Z>CO.F(tV*;T) + / £ = ^S, P (r*,X*). 

T 

(58) 

Since t* by construction satisfies all of the marginal- 
ization constraints (i.e., £3 T st( x s,xt) = t*(x s )), the 
Lagrangian reduces to the cost function, so that we have 
shown that the dual value Q§. p (X*) is equal to 

^2^2t* (x s )6 s (x s )+ T; t {x s ,x t )6 st {x s ,x t ), 

s£V x s (s,t)£E x 3 ,x t 

or equivalently by J2 s ev ^K)+E( s t) 

S s t(x s , x t ), 

which is the optimal primal value. By strong duality, the 
pair (t*,A*) are primal-dual optimal. 



For a general graph with cycles, the above proof does 
not establish that any fixed point of Algorithm 2 (i.e., 
one for which v* does not satisfy the OS criterion) 
necessarily specifies a dual-optimal solution of the LP 
relaxation. Indeed, in follow-up work, Kolmogorov [32] 
has constructed a particular fixed point, for which the OS 
criterion is not satisfied, that does not specify an optimal 
dual solution. However, for problems that involve only 
binary variables and pairwise interactions, Kolmogorov 
and Wainwright [33] have strengthened Theorem [2] to 
show that a message-passing fixed point always specifies 
an optimal dual solution. 

However, on a tree-structured graph, the tree- 
reweighted max-product updates reduce to the ordinary 
max-product (min-sum) updates, and any fixed point v* 
must satisfy the OS criterion. In this case, we can use 
Theorem |2| to obtain the following corollary: 

Corollary 2 ( Ordinary max-product): For a tree- 
structured graph T, the ordinary max-product algorithm 
is an iterative method for solving the dual of the exact 
LP representation of the MAP problem: 



max i 



I 0(x)) = max (9, r). (59) 

rGLOCAL(T) 

Proof: By Lemma [2 the MAP problem 
max x6 X" {8, <M X )} has me alternative LP representation 
as max T£ MARG(T)(^ T )- By Corollary ^ the relax- 
ation based on LOCAL(T) is exact for a tree, so that 
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max reMARG ( T )(0, r) = max TGLOCAL(T) (6>, r), from 
which equation ( I59> follows. For the case of a tree, the 
only valid choice of p e is the vector of all ones, so 
the tree-reweighted updates must be equivalent to the 
ordinary max-product algorithm. The result then follows 
from Theorem |21 ■ 




D. Additional properties 

As proved in Corollariesnand|2l the techniques given 
here are always exact for tree-structured graphs. For 
graphs with cycles, the general mode-finding problem 
considered here includes many NP-hard problems, so 
that our methods cannot be expected to work for all 
problems. In general, their performance — more specif- 
ically, whether or not a MAP configuration can be 
obtained — depends on both the graph structure, and the 
form of the parameter vector 0. In parallel and follow- 
up work to this paper, we have obtained more precise 
performance guarantees for our techniques when applied 
to particular classes of problems (e.g., binary linear 
coding problems [20]; binary quadratic programs [33]). 
We discuss these results in more detail in the sequel. 

In this section, we begin with a comparison of 
reweighted max-product to the standard max-product. In 
particular, we explicitly construct a simple problem for 
which the ordinary max-product algorithm outputs an 
incorrect answer, but for which the reweighted updates 
provably find the global optimum. We then demonstrate 
the properties of edge-based versus tree-based updates, 
and discuss their relative merits. 

1) Comparison with ordinary max-product: Recall 
that for any graph with cycles, the tree-reweighted max- 
product algorithm (Algorithm 2) differs from the ordi- 
nary max-product algorithm in terms of the reweighting 
of messages and potential functions, and the involvement 
of the reverse direction messages. Here we illustrate with 
a simple example that these modifications are in general 
necessary for a message-passing algorithm to satisfy the 
exactness condition of Theorem |2ja). More precisely, 
we construct a fixed point of the ordinary max-product 
algorithm that satisfies the optimum-specification (OS) 
criterion, yet the associated configuration x* is not MAP- 
optimal. 

In particular, consider the simple graph Gdia shown in 
Figure [SJ and suppose that we wish to maximize a cost 
function of the form 



Fig. 5: Simple diamond graph Gdia 



a 



(x 1 +x 4 )+f3(x 2 +x 3 )+~/ ^2 [x s (x t -l)+xt(x s -l)] 
(s,t)eE 

(60) 

Here the minimization is over all binary vectors x G 
{0, l} 4 , and a,/3 and 7 are parameters to be specified. 



By design, the cost function J60i is such that if we make 
7 sufficiently positive, then any optimal solution will 
either be 4 := [0 0] or l 4 := [l 1 1 l]. 
More concretely, suppose that we set a = 0.31, ft = 
—0.30, and 7 = 2.00. With these parameter settings, it is 
straightforward to verify that the optimal solution is l 4 . 
However, if we run the ordinary max-product algorithm 
on this problem, it converges to a set of singleton and 
edge-based pseudo-max-marginals v* of the form: 



u* t (x s ,x t ) 



1 0.0250 
1 0.0034 



for s e {1,4} 
if s e {2,3}. 



1 0.0034 

0.0034 0.0006 

1 0.0250 

0.0025 0.0034 



for(M) = (2,3) 



otherwise. 



Note that these pseudo-max-marginals and the config- 
uration 4 satisfy the OS criterion (since the optimum 
is uniquely attained at x s = for every node, and at 
the pair (x s ,xt) — (0,0) for every edge); however, the 
global configuration 4 is not the MAP configuration 
for the original problem. This problem shows that the 
ordinary max-product algorithm does not satisfy an 
exactness guarantee of the form given in Theorem 13 

In fact, for the particular class of problems exemplified 
by equation J60i . we can make a stronger assertion about 
the tree-reweighted max-product algorithm: namely, it 
will never fail on a problem of this general form, where 
the couplings 7 are non-negative. More specifically, in 
follow-up work to the current paper, Kolmogorov and 
Wainwright [33] have established theoretical guarantees 
on the performance of tree-reweighted message-passing 
for problems with binary variables and pairwise cou- 
plings. First, it can be shown that TRW message-passing 
always succeeds for any submodular binary problem, 
of which the example given in Figure |5] is a special 
case. Although it is known [26] that such problems 
can be solved in polynomial time via reduction to a 
max-flow problem, it is nonetheless interesting that tree- 



reweighted message-passing is also successful for this 
class of problems. An additional result [33] is that for 
any pairwise binary problem (regardless of nature of 
the pairwise couplings), any variable s that is uniquely 
specified by a TRW fixed point (i.e., for which the set 
argmax Xl> v*{x s ) is a singleton) is guaranteed to be 
correct in some globally optimal configuration. Thus, 
TRW fixed points can provide useful information about 
parts of the MAP optimal solution even when the OS 
criterion is not satisfied. 

2 ) Comparison of edge-based and tree-based updates: 
In this section, we illustrate the empirical performance of 
the edge-based and tree-based updates on some sample 
problems. So as to allow comparison to the optimal 
answer even for large problems, we focus on binary 
problems. In this case, the theoretical guarantees [33] 
described above allow us to conclude that the tree- 
reweighted method yields correct information about (at 
least part of the) optimum, without any need to compute 
the exact MAP optimum by brute force. Thus, we can 
run simply the TRW algorithm — either the edge-based or 
tree-based updates — on any submodular binary problem, 
and be guaranteed that given a fixed point, it will either 
specify a globally optimal configuration (for attractive 
couplings), or that any uniquely specified variables (i.e., 
for which argmax^ v*{x s ) is a singleton) will be 
correct (for arbitrary binary problems). 

Our comparison is between the parallel edge-based 
form of reweighted message-passing (Algorithm 2), and 
the tree-based Algorithm 3 described in Appendix Icl 
We focus on the amount of computation, as measured 
by the number of messages passed along each edge, 
required to either compute the fixed point (up to e = 1 x 
1CP 8 accuracy), or — in the case of tree-based updates — 
to find a configuration x* on which all trees agree. In 
this latter case, Proposition ^ guarantees that the shared 
configuration x* must be globally MAP-optimal for the 
original problem, so that there is no need to perform any 
further message-passing. 

We performed trials on problems in the Ising form (0, 
defined on grids with n = 400 nodes. For the edge-based 
updates, we used the uniform setting of edge appearance 
probabilities p st = ■jgjp f° r the tree-based updates, we 
used two spanning trees, one with the horizontal rows 
plus an connecting column and the rotated version of 
this tree, placing weight p(T l ) = | on each tree i = 
1,2. In each trial, the single node potentials were chosen 
randomly as 9 S ~ U[—l, 1], whereas the edge couplings 
were chosen in one of the following two ways. In the 
attractive case, we chose the couplings as 9 st ~ W[0, 7], 
where 7 > is the edge strength. In the mixed case, we 
chose 9 s t ~ U\— % , J]. In both cases, we used damped 
forms of the updates (linearly combining messages or 
pseudo-max-marginals in the logarithmic domain) with 



Attractive couplings 
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Coupling strength 

(a) 

Mixed couplings 




Coupling strength 



(b) 

Fig. 6. Comparison of parallel edge-based message- 
passing (Algorithm 2) and tree-based updates (Al- 
gorithm 3 described in Appendix |Cj on a nearest- 
neighbor grid with n — 400 variables, (a) Attractive 
couplings, (b) Mixed couplings. 



damping parameter A = 0.50. 

We investigated the algorithmic performance for a 
range of coupling strengths 7 for both attractive and 
mixed cases. For the attractive case, the TRW algorithm 
is theoretically guaranteed [33] to always find an optimal 
MAP configuration. For the mixed case, the average 
fraction of variables in the MAP optimal solution that 
the reweighted message-passing recovered was above 
80% for all the examples that we considered; for mixed 
problems with weaker observation terms 9 S , this fraction 
can be lower [see 33]. Figure |6] shows results comparing 
the behavior of the edge-based and tree-based updates. 
In each panel, plotted on the y-axis is the number of 
messages passed per edge (before achieving the stopping 
criterion) versus the coupling strength 7. Note that for 
more weakly coupled problems, the tree-based updates 
consistently find the MAP optimum with lower com- 
putation than the edge-based updates. As the coupling 
strength is increased, however, the performance of the 
tree-based updates slows down and ultimately becomes 
worse than the edge-based updates. In fact, for strong 
enough couplings, we observed on occasion that the tree- 



based updates could fail to converge, but instead oscillate 
(even with small damping parameters). These empirical 
observations are consistent with subsequent observa- 
tions and results by Kolmogorov [31], who developed 
a modified form of tree-based updates for which certain 
convergence properties are guaranteed. (In particular, in 
contrast to the tree-based schedule given in Appendix Icl 
they are guaranteed to generate a monotonically non- 
increasing sequence of upper bounds.) 

3) Related work: In related work involving the meth- 
ods described here, we have found several applications 
in which the tree relaxation and iterative algorithms de- 
scribed here are useful. For instance, we have applied the 
tree-reweighted max-product algorithm to a distributed 
data association problem involving multiple targets and 
sensors [13]. For the class of problem considered, the 
tree-reweighted max-product algorithm converges, typ- 
ically quite rapidly, to a provably MAP-optimal data 
association. In other colloborative work, we have also 
applied these methods to decoding turbo-like and low 
density parity check (LDPC) codes [19], [18], [20], 
and provided finite-length performance guarantees for 
particular codes and channels. In the context of decoding, 
the fractional vertices of the polytope LOCAL(G) have 
a very concrete interpretation as pseudocodewords [e.g., 
22], [25], [29], [45]. More broadly, it remains to further 
explore and analyze the range of problems for which the 
iterative algorithms and LP relaxations described here 
are suitable. 

VI. Discussion 

In this paper, we demonstrated the utility of con- 
vex combinations of tree-structured distributions in up- 
per bounding the value of the maximum a posteriori 
(MAP) configuration on a Markov random field (MRF) 
on a graph with cycles. A key property is that this 
upper bound is tight if and only if the collection of 
tree-structured distributions shares a common optimum. 
Moreover, when the upper bound is tight, then a MAP 
configuration can be obtained for the original MRF on 
the graph with cycles simply by examining the optima 
of the tree-structured distributions. This observation mo- 
tivated two approaches for attempting to obtain tight 
upper bounds, and hence MAP configurations. First of 
all, we proved that the Lagrangian dual of the problem 
is equivalent to a linear programming (LP) relaxation, 
wherein the marginal polytope associated with the origi- 
nal MRF is replaced with a looser constraint set formed 
by tree-based consistency conditions. Interestingly, this 
constraint set is equivalent to the constraint set in the 
Bethe variational formulation of the sum-product al- 
gorithm [47]; in fact, the LP relaxation itself can be 
obtained by taking a suitable limit of the "convexified" 
Bethe variational problem analyzed in our previous 



work [41], [44]. Second, we developed a family of tree- 
reweighted max product algorithms that reparameterize 
a collection of tree-structured distributions in terms of a 
common set of pseudo-max-marginals on the nodes and 
edges of the graph with cycles. When it is possible to 
find a configuration that is locally optimal with respect to 
every single node and edge pseudo-max-marginal, then 
the upper bound is tight, and the MAP configuration can 
be obtained. Under this condition, we proved that fixed 
points of these message-passing algorithms specify dual- 
optimal solutions to the LP relaxation. A corollary of 
this analysis is that the ordinary max-product algorithm, 
when applied to trees, is solving the Lagrangian dual of 
an exact LP formulation of the MAP estimation problem. 

Finally, in cases in which the methods described 
here do not yield MAP configurations, it is natural to 
consider strengthening the relaxation by forming clus- 
ters of random variables, as in the Kikuchi approxima- 
tions described by Yedidia et al. [47]. In the context 
of this paper, this avenue amounts to taking convex 
combinations of hypertrees, which (roughly speaking) 
correspond to trees defined on clusters of nodes. Such 
convex combinations of hypertrees lead, in the dual 
reformulation, to a hierarchy of progressively tighter LP 
relaxations, ordered in terms of the size of clusters used 
to form the hypertrees. On the message-passing side, it 
is also possible to develop hypertree-reweighted forms 
of generalizations of the max-product algorithm. 

Acknowledgments: We thank Jon Feldman and David 
Karger for helpful discussions. 

Appendix 

A. Conversion from factor graph to pairwise interac- 
tions 

In this appendix, we briefly describe how any factor 
graph description of a distribution over a discrete (multi- 
nomial) random vector can be equivalently described 
in terms of a pairwise Markov random field [23], to 
which the pairwise LP relaxation based on LOCAL(G) 
specified by equations d9at . \9b\ and dl0> can be applied. 
To illustrate the general principle, it suffices to show 
how to convert a factor /123 defined on the triplet 
{Xi, X2, X$} of random variables into a pairwise form. 
Say that each Xi takes values in some finite discrete 
space Xi. 

Given the factor graph description, we associate a new 
random variable Z with the factor node /, which takes 
values in the Cartesian product space Z := X\ x X2 x X$. 
In this way, each possible value z of Z can be put 
in one-to-one correspondence with a triplet (zi, Z2, £3), 
where Zi G Xi. For each s G {1,2,3}, we define a 
pairwise compatibility function ipf s , corresponding to 



the interaction between Z and X s , by 

ipf a (z,x 3 ) := l[z a = x a ], 

where I [ z s - — x s j is 3. {0, l}-valued indicator function 
for the event {z s — x s }. We set the singleton compatility 
functions as 

^/(z) = f(zi,z 2 ,z 3 ), and ip a (x a ) = 1. 

With these definitions, it is straightforward to verify that 
the augmented distribution given by 

3 3 

^fi 2 ) IJW 1 *) JJV/s(^,^s) (61) 

s=l s=l 

marginalizes down to f(xi,X2,x 3 ). Thus, our aug- 
mented model with purely pairwise interactions faith- 
fully captures the interaction among the triplet 

{xi,x 2 ,x 3 }. 

Finally, it is straightforward to verify that if we apply 
the pairwise LP relaxation based on LOCAL(G) to the 
augmented model (16 1> . it generates an LP relaxation 
in terms of the Xi variables that involves singleton 
pseudomarginal distributions r s , and a pseudomarginal 
tjv(/) over the variable neighborhood of each factor /. 
These pseudomarginals are required to be non-negative, 
normalized to one, and to satisfy the pairwise consis- 
tency conditions 



(62) 



x t , teN(f)\{ s } 



for all s G N(f), and for all factor nodes /. When the 
factor graph defines an LDPC code, this procedure gen- 
erates the LP relaxation studied in Feldman et al. [20]. 
More generally, this LP relaxation can be applied to 
factor graph distributions other than those associated 
with LDPC codes. 

B. Proof of Lemma |2] 

By definition, we have $oo(0(T)) := 
max x6 ^n(6'(T), 0(x)). We re-write this function 
in the following way: 



<S>oo(0{T)) ( = } 

(b) 



max (9(T), t) 

reLOCAL(G) 

max (9(T), t) 

TgLOCAL(G;T) 



where equality (a) follows from Lemma ^ an d equality 
(b) follows because 6(T) a = for all a <£ 1{T). In this 
way, we recognize <&oo (6*(T)) as the support function 
of the set LOCAL(G;T), from which it follows [28] 
that the conjugate dual is the indicator function of 
LOCAL(G; T), as specified in equation {29). 

For the sake of self-containment, we provide an ex- 
plicit proof of this duality relation here. If r belongs 



to LOCAL(G; T), then (8(T), r) - $oo(0(T)) < 
holds for all 6(T) G S (T), with equality for 6{T) = 0. 
From this relation, we conclude that 

sup (0(T), r) — $oo(6(T)) = 

6(T)£S(T) 

whenever r G LOCAL(G;T). 

On the other hand, if r ^ LOCAL(G;T), then by 
the (strong) separating hyperplane theorem [28], there 
must exist some vector 7 and constant j3 such that 

(i) (7, M> < P for a11 A* S LOCAL(G; T); and 

(ii) (7, t) > 0. Since conditions (i) and (ii) do not 
depend on elements j a with a ^ I(T), we can take 
7 = 7(T) G £ (T) without loss of generality. We then 
have 

(7(T),r)-$ 0O ( 7 (T)) > ( 7 (T),r)-/9 > 0. (63) 

Note that conditions (i) and (ii) are preserved under 
scaling of both 7 (T) and (3 by a positive number, so 
that can send the quantity d63i to positive infinity. We 
thus conclude that 

sup {(6(T), t) - <S>oc{6{T))} = +<x 

9(T)eS(T) 

whenever r ^ LOCAL(G; T). This completes the proof 
of the lemma. 



C. Tree-based updates 

This appendix provides a detailed description of tree- 
based updates. In this scheme, each iteration involves 
multiple rounds of message-passing on each tree T in the 
support of p. More specifically, the computational engine 
used within each iteration is the ordinary max-product 
algorithm, applied as an exact technique to compute 
max-marginals for each tree-structured distribution. 

At any iteration n, we let n (T) denote a set of 
exponential parameters for the tree T. To be clear, the 
notation n (T) used in this appendix differs slightly 
from its use in the main text. In particular, unlike 
in the main text, we can have 6%(T) ^ 0£(T') for 
distinct trees T = T' at immediate iterations, although 
upon convergence this equality will hold. Each step of 
the algorithm will involve computing, for every tree 
T G supp(p), the max-marginals v n (T), associated with 
the tree-structured distribution p(x;9(T)). (Once again, 
unlike the main text, we need not have v2(T) = v™(T') 
for distinct trees T = T' .) Overall, the tree-based updates 
take the form given in Figure 

Termination: Observe that there are two possible ways 
in which Algorithm 3 can terminate. On one hand, the 
algorithm stops if in Step 2(b)(i), a collection of tree- 
structured distributions is found that all share a common 
optimizing configuration. Herein lies the possibility of 



Algorithm 3: Tree-based updates 

1) For each spanning tree T G supp(p), initialize 8°(T) via 

0°(T) = 9 S V seV, 

9° st (T) = —9 a V (s,t)GE(T), 9° st (T) = V (s, t) G E\E(T). 

Pst 

2) For iterations n = 0, 1, 2, . . ., do the following: 

(a) For each tree T G supp(p), apply the ordinary max-product algorithm to compute the max- 
marginals v n (T) corresponding to the tree-structured distribution p(x;9 n (T)). 

(b) Check if the tree distributions share a common optimizing configuration (i.e., if fir OPT(6* ra (T)) 
is non-empty). 

(i) If yes, output any shared configuration and terminate. 

(ii) If not, check to see whether or not the following agreement condition holds: 

v s {T) = v s {T') VseV, VT,T' G supp(p), (64a) 
v s t{T) = v s t{T') VT, T' G supp(p) s.t. (s,t) eE(T)nE(T'). (64b) 

If this agreement of all max-marginals holds, then terminate. Otherwise, form a new exponential 
parameter 8 as follows: 

= ^ p(T) log (T) V sGF (65a) 

T 

flfc* = E ^^'" ^nfffnL V (65b) 
T9(s,i) s V ^ * V ^ 

Define 9 n+1 (T) on each tree T G supp(p) as in Step 1 with 9 = 9, and proceed to Step 2(a). 
Fig. 7: Tree-based updates for finding a tree-consistent set of pseudo-max-marginals. 



finite termination, since there is no need to wait until 
the values of the tree max-marginals v 11 (T) all agree for 
every tree. Otherwise, the algorithm terminates in Step 
2(b)(ii) if the max-marginals for each tree all agree with 
one another, as stipulated by equation (I64> . 

A key property of the updates in Algorithm 3 is that 
they satisfy the p-reparameterization condition: 

Lemma 11: For any iteration n, the tree-structured 
parameters {9 n (T)} of Algorithm 3 satisfy 
E T P(T)9 n (T) = 9. 

Proof: The claim for n = follows from directly 
the initialization in Step 1 . In particular, we clearly have 
J2 T P( T ) e °( T ) = for an Y node seV. For any edge 
{s,t) G E, we compute: 

5>(T)0°(T) = £p(T)[i-0 si ] = 8 st . 

To establish the claim for n+l, we proceed by induction. 
By the claim just proved, it suffices to show that 8, as 
defined in equation ( I65l l. defines the same distribution as 
9. 

We begin by writing (9, </>(x)} as 

EE^w+ 12 12^t{x s ,x t ) 



Using the definition J65t . we can re-express it as follows 

12p(t){1212 1o ^s(t)(x s ) 

T { s£V x B 

+ ,J m S 8 <»w| <66) 

Recall that for each tree T, the quantities v n (T) 
are the max-marginals associated with the distribution 
p(x;0(T)). Using the fact (I36> that the max-marginals 
specify a reparameterization of p(x;9 n (T)), each term 
within curly braces is simply equal (up to an additive 
constant independent of x) to (9 n (T), </>(x)). Therefore, 
by the induction hypothesis, the RHS of equation (I66> 
is equal to (9, </>(x)}, so that the claim of the lemma 
follows. ■ 



On the basis of Lemma it is straightforward to 
prove the analog of Theorem|2ja) for Algorithm 3. More 
specifically, whenever it outputs a configuration, it must 
be an exact MAP configuration for the original problem 
p(x;0). 
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D. Proof of ' Lemma ^ Recalling the definition of !Fg. p {T,v\T) from equa- 

„ 7 c . . . ,. . tion (|55j, we can write V r p(T)G(t;lu*-.T) as 

We first prove an intermediate result: ^ J rv ' K ' 

Lemma 12: Let v* and !/* t represent pseudo-max- irp\ <r / * T \ 

marginals, as defined as in equation J48> . with the ^ ' 9 'P^ ' ' ' 

message ill* := exp(o;*), where the exponential is taken ^^.^ 

element-wise. Then the Lagrangian Cg p (t,uj*) can be / ,p(^)/ ^°S iy s( x s)\ T s( x s)~/ J T st( x s, x t)] , 
written as T s ^ r Xs Xt 

or equivalently as 



^2 p{T)g(T,uj*;T)+^2K s ^2T s {x s )+^2 n st E T st (x s ,x}) 



(s,t) x 3 ,x t 



T 



where n s and n st are constants, and Q(t, ui*\ T) is given , t , 

by E E( E p{T))\ogv* s [x s )C ts {x s ), 

(s,t)£E x B {T\t=TT T ( S )} 

^^r s (i s )log<( Ir ) + where = Ts{Xs) _ J2 Xt T st (x s ,x t ) is the con- 

seV Xs straint. Note that for each fixed (s,t) and x s the term 

E E r st (x s ,x t )log E{T|t=^( s) }P( r ) lo g^(^) can be interpreted as a 

(s ()eE(T) (i z t ) v s\ x s) v t\ x t) contribution to the Lagrange multiplier associated with 

Proof: Straightforward algebraic manipulation al- the constraint Ct s {x s ). 
lows us to re-express the Lagrangian C§ <p (t, A) as Finally, since the Lagrangian is linear in the Lagrange 

multipliers, we can use Lemma ^] to express the La- 
(- grangian Cg p (r, A*;T) as 

E^( r ) EE r ^*(^)+ E pofub')] 

t Uev x. ter(s) V p(T)F s . p (r, v*; T) + E «« E T ^) + 

E [ E ^c*., »t) [ M *»'* f) -^(xo-^g.^)]] [• v E a^), 



(s,t)£E(T) x s ,x t 



(s,t) 



Using the definition of v* in terms of M* = exp(u*), where \* is the vector of Lagrange multipliers with 

we can then write Cg p {r,uj*) in terms of £(t,w*;T) components 

as 

\$ s (x s ):=u>t(x s )- E p(T)\ogv* s (x s ). 

E p( t )G( t i w *; T ) + E Ks E r ^(^)+ {t 1 7r T ( S )=t} 

^ Ksf ^ Tsi(ls,It) ' £. Proof of Lemma M 

(s,t) x s ,x t 

Since by assumption the pseudo-max-marginals v* 
where the constants k s and n st arise from the normal- defined by M* = exp(A*) (with the exponential defined 
ization of v s and v st . ■ element-wise) satisfy the optimum specification crite- 

rion, we can find a configuration x* that satisfies the 
local optimality conditions J38i for every node and edge 
Recall that we root all trees T a fixed vertex r G V, on the full graph G. 
so that each vertex s ^ r has a unique parent denoted Since the pseudo-max-marginals v* are defined by a 
7r T (s). Using the parent-to-child representation of the fixed point M* = cxp(A*) (with the exponential defined 
tree T, we can re-express Q(t, lo*;T) as element-wise) of the update equation (|50j, the pseudo- 

max-marginals must be pairwise-consistent. More ex- 
{^2T r (x r )\ogv;(x r )+ plicitly, for any edge (s,t) and x s £ X s , the pairwise 

Xr consistency condition max It v* t (x s , xt) — n s t v*{xs) 

i,* ( T 7* t i \ ) N holds, where n s t is a positive constant independent of 
2 T st (x s ,xt) log ■; r — >+ x s . Using this fact, we can write 

(.,^(.))(x.,x, rTw ) ^W^W-) 

maxlo ^gii^il^j - maxlo ma " Xxt v st\ x si x t ) 
^2^2\ogv*(x s )[r s (x s ) - E r st(a; s ,^)] S ° g ^*(x s ) ~ ° S ^(^ s ) 

x = x * = \ogn st . (67) 
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Moreover, since by assumption the pseudo-max- 
marginals v* satisfy the optimum specification criterion, 
we can find a configuration x* that satisfies the local 
optimality conditions (I38> for every node and edge on 
the full graph G. For this configuration, we have the 
equality 

maxlog^*(a; s ) = log^*(a;*) for all s £ V, (68) 



and 



log 



v st\ x s i x t I 

v*(x* s ) 



log 



maxx, u* t (x* s ,x t ) 



v*(x* s ) 



log K s 



= maxlog^il^ (69) 

x,,x t vl(x s ) 

for all (s, t) s E, where the final equality follows from 
equation ( 167k 

Recall the definition of r* from equation ( I57at . Using 
equations ( l68l and d69l . we have 

^T s {x s )\ogv* s {x s ) < J2 T s( x s)^g^ s (x s ()10) 
for all s £ V, and 



^2 T s (x s ,X t )\0g 



vlt( x s,x t ) 

vt(x s ) 



v* st {x a ,Xt) 

v *s( x s) 



(71) 



for all (s,t) 6 E. Both equations (To) and ( 1711 
hold for all r 6 S (i.e., for all non-negative r 
such that t s (ie s ) = 1 and J2 X Xt Jst(x s ,x t ) 
1 ). Finally, inequalities J70I and (17 1 1 imply that 

max reS T(t, v*;T) = T{t* , v*; T) as claimed. 
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